{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "75884451-1ef9-49ab-8561-4eebe036cb3f",
   "metadata": {},
   "source": [
    "# Supplementary Vignette 2\n",
    "\n",
    "## Example workflow for H&E images\n",
    "\n",
    "Here we demonstrate a typical workflow for preprocessing of H&E images, consisting of the following steps:\n",
    "\n",
    "1. Loading the raw image\n",
    "2. Defining a simple preprocessing pipeline for tissue detection\n",
    "3. Creating a PyTorch DataLoader for interfacing with any downstream machine learning model\n",
    "\n",
    "The image used in this example is publicly avilalable for download: http://openslide.cs.cmu.edu/download/openslide-testdata/Aperio/\n",
    "\n",
    "**a. Load the image**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "4fe03d4f-1e07-4afe-9ed2-dc027977b782",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathml.core import SlideData, types\n",
    "\n",
    "# load the image\n",
    "wsi = SlideData(\"../../data/CMU-1.svs\", name = \"example\", slide_type = types.HE)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd6e65ee-13b7-486b-85bf-b011a851f255",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true,
    "tags": []
   },
   "source": [
    "**b. Define a preprocessing pipeline**\n",
    "\n",
    "Pipelines are created by composing a sequence of modular transformations; in this example we apply a blur to reduce noise in the image followed by tissue detection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "b94114e2-4384-4f6e-b0cd-14e4ebcb205c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathml.preprocessing import Pipeline, BoxBlur, TissueDetectionHE\n",
    "\n",
    "pipeline = Pipeline([\n",
    "    BoxBlur(kernel_size=15),\n",
    "    TissueDetectionHE(mask_name = \"tissue\", min_region_size=500, \n",
    "                      threshold=30, outer_contours_only=True)\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c5177e6-d7e5-4218-a150-362c05e35988",
   "metadata": {},
   "source": [
    "**c. Run preprocessing**\n",
    "\n",
    "Now that we have constructed our pipeline, we are ready to run it on our WSI.\n",
    "PathML supports distributed computing, speeding up processing by running tiles in parallel among many workers rather than processing each tile sequentially on a single worker. This is supported by [Dask.distributed](https://distributed.dask.org/en/latest/index.html) on the backend, and is highly scalable for very large datasets. \n",
    "\n",
    "The first step is to create a `Client` object. In this case, we will use a simple cluster running locally; however, Dask supports other setups including Kubernetes, SLURM, etc. See the [PathML documentation](https://pathml.readthedocs.io/en/latest/running_pipelines.html#distributed-processing) for more information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "c7eb71bf-2189-41ee-9e51-194f9c655ab2",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask.distributed import Client, LocalCluster\n",
    "\n",
    "cluster = LocalCluster(n_workers=6)\n",
    "client = Client(cluster)\n",
    "\n",
    "wsi.run(pipeline, distributed=True, client=client);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "46aaf3f2-7ccf-4bcd-be64-888dafc6c096",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of tiles extracted: 150\n"
     ]
    }
   ],
   "source": [
    "print(f\"Total number of tiles extracted: {len(wsi.tiles)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4959af73-db3e-4c38-a7f3-1e6f854e50ac",
   "metadata": {},
   "source": [
    "**e. Save results to disk**\n",
    "\n",
    "The resulting preprocessed data is written to disk, leveraging the HDF5 data specification optimized for efficiently manipulating larger-than-memory data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "0abf4264-13f5-4264-b8f3-93bf97589b77",
   "metadata": {},
   "outputs": [],
   "source": [
    "wsi.write(\"./data/CMU-1-preprocessed.h5path\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad585896-b456-40fb-821e-dc45894519af",
   "metadata": {},
   "source": [
    "**f. Create PyTorch DataLoader**\n",
    "\n",
    "The `DataLoader` provides an interface with any machine learning model built on the PyTorch ecosystem"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "324b6286-c3d7-49ad-8a86-4f70998c4358",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathml.ml import TileDataset\n",
    "from torch.utils.data import DataLoader\n",
    "\n",
    "dataset = TileDataset(\"./data/CMU-1-preprocessed.h5path\")\n",
    "dataloader = DataLoader(dataset, batch_size = 16, num_workers = 4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f17c90a3-feed-48a4-97df-35c7ecd9a7e1",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Summary\n",
    "\n",
    "Here we demonstrate a complete `PathML` workflow for analyzing brightfield images:\n",
    "\n",
    "1. Loading the raw image\n",
    "2. Define a simple preprocessing pipeline for tissue detection\n",
    "3. Create a PyTorch DataLoader for interfacing with any downstream machine learning model\n",
    "\n",
    "Full documentation of the `PathML` API is available at https://pathml.org.  \n",
    "\n",
    "Full code for this vignette is available at https://github.com/Dana-Farber-AIOS/pathml/tree/master/examples/manuscript_vignettes_stable"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "pathml",
   "language": "python",
   "name": "pathml"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
